Language is Not a formal System, Language is glorious Chaos - Chris Manning (cs224n)
“Language is Not a formal System, Language is glorious Chaos” — . . . Chris Manning , CS224N ( https://xkcd.com/1576/)
Meaning of a Word
The idea that is represented by words, phrases, etc. is the idea that a person wants to convey using words, signs, etc. This idea is expressed in a work of writing, art, etc.
Representing Words as Discrete Symbols
In NLP, we treat words as discrete symbols. For example, consider the words “hotel,” “conference,” and “motel” as Localist Representations using one-hot vectors:
- motel = [0, 0, …, 0, 1, 0, 0, 0]
- hotel = [0, 0, 1, 0, …, 0, 0, 0]
What will be the dimension of the vector?
The dimension of the vector will be the number of words in the vocabulary. If a language has only two words, symbols, or signs in its vocabulary, then it will have two vectors: [0,1] and [1,0].
Problems with Words as Discrete Symbols
It’s very important to understand and address the issues with treating words as discrete symbols. For instance, consider the vectors for “motel” and “hotel.” Aren’t they orthogonal vectors?
What are orthogonal vectors?
Orthogonal vectors are vectors that are perpendicular to each other, meaning their dot product is zero. Let’s calculate the dot product of “motel” and “hotel” — it’s zero!
What if the Dot Product is Zero?
Consider two 2-D vectors A and B. We can represent A and B as:
- A = 2i + 3j
- B = 2i + 2j
(where i and j are unit vectors along the x and y axes)
We can then ask how much A lies on B. The answer will be the component of A along B. So, what is this component?